{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Discrete Choice Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fair's Affair data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A survey of women only was conducted in 1974 by *Redbook* asking about extramarital affairs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import statsmodels.api as sm\n",
    "from scipy import stats\n",
    "from statsmodels.formula.api import logit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(sm.datasets.fair.SOURCE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(sm.datasets.fair.NOTE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dta = sm.datasets.fair.load_pandas().data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dta[\"affair\"] = (dta[\"affairs\"] > 0).astype(float)\n",
    "print(dta.head(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dta.describe())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "affair_mod = logit(\n",
    "    \"affair ~ occupation + educ + occupation_husb\"\n",
    "    \"+ rate_marriage + age + yrs_married + children\"\n",
    "    \" + religious\",\n",
    "    dta,\n",
    ").fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(affair_mod.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How well are we predicting?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "affair_mod.pred_table()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The coefficients of the discrete choice model do not tell us much. What we're after is marginal effects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mfx = affair_mod.get_margeff()\n",
    "print(mfx.summary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "respondent1000 = dta.iloc[1000]\n",
    "print(respondent1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "resp = dict(\n",
    "    zip(\n",
    "        range(1, 9),\n",
    "        respondent1000[\n",
    "            [\n",
    "                \"occupation\",\n",
    "                \"educ\",\n",
    "                \"occupation_husb\",\n",
    "                \"rate_marriage\",\n",
    "                \"age\",\n",
    "                \"yrs_married\",\n",
    "                \"children\",\n",
    "                \"religious\",\n",
    "            ]\n",
    "        ].tolist(),\n",
    "    )\n",
    ")\n",
    "resp.update({0: 1})\n",
    "print(resp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mfx = affair_mod.get_margeff(atexog=resp)\n",
    "print(mfx.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`predict` expects a `DataFrame` since `patsy` is used to select columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "respondent1000 = dta.iloc[[1000]]\n",
    "affair_mod.predict(respondent1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "affair_mod.fittedvalues[1000]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "affair_mod.model.cdf(affair_mod.fittedvalues[1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The \"correct\" model here is likely the Tobit model. We have an work in progress branch \"tobit-model\" on github, if anyone is interested in censored regression models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise: Logit vs Probit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(12, 8))\n",
    "ax = fig.add_subplot(111)\n",
    "support = np.linspace(-6, 6, 1000)\n",
    "ax.plot(support, stats.logistic.cdf(support), \"r-\", label=\"Logistic\")\n",
    "ax.plot(support, stats.norm.cdf(support), label=\"Probit\")\n",
    "ax.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(12, 8))\n",
    "ax = fig.add_subplot(111)\n",
    "support = np.linspace(-6, 6, 1000)\n",
    "ax.plot(support, stats.logistic.pdf(support), \"r-\", label=\"Logistic\")\n",
    "ax.plot(support, stats.norm.pdf(support), label=\"Probit\")\n",
    "ax.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compare the estimates of the Logit Fair model above to a Probit model. Does the prediction table look better? Much difference in marginal effects?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generalized Linear Model Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(sm.datasets.star98.SOURCE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(sm.datasets.star98.DESCRLONG)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(sm.datasets.star98.NOTE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dta = sm.datasets.star98.load_pandas().data\n",
    "print(dta.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    dta[\n",
    "        [\"NABOVE\", \"NBELOW\", \"LOWINC\", \"PERASIAN\", \"PERBLACK\", \"PERHISP\", \"PERMINTE\"]\n",
    "    ].head(10)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    dta[\n",
    "        [\"AVYRSEXP\", \"AVSALK\", \"PERSPENK\", \"PTRATIO\", \"PCTAF\", \"PCTCHRT\", \"PCTYRRND\"]\n",
    "    ].head(10)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "formula = \"NABOVE + NBELOW ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT \"\n",
    "formula += \"+ PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Aside: Binomial distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Toss a six-sided die 5 times, what's the probability of exactly 2 fours?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stats.binom(5, 1.0 / 6).pmf(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.special import comb\n",
    "\n",
    "comb(5, 2) * (1 / 6.0) ** 2 * (5 / 6.0) ** 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.formula.api import glm\n",
    "\n",
    "glm_mod = glm(formula, dta, family=sm.families.Binomial()).fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(glm_mod.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of trials "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "glm_mod.model.data.orig_endog.sum(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "glm_mod.fittedvalues * glm_mod.model.data.orig_endog.sum(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact\n",
    "on the response variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exog = glm_mod.model.data.orig_exog  # get the dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "means25 = exog.mean()\n",
    "print(means25)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "means25[\"LOWINC\"] = exog[\"LOWINC\"].quantile(0.25)\n",
    "print(means25)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "means75 = exog.mean()\n",
    "means75[\"LOWINC\"] = exog[\"LOWINC\"].quantile(0.75)\n",
    "print(means75)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, `predict` expects a `DataFrame` since `patsy` is used to select columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "resp25 = glm_mod.predict(pd.DataFrame(means25).T)\n",
    "resp75 = glm_mod.predict(pd.DataFrame(means75).T)\n",
    "diff = resp75 - resp25"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The interquartile first difference for the percentage of low income households in a school district is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"%2.4f%%\" % (diff[0] * 100))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nobs = glm_mod.nobs\n",
    "y = glm_mod.model.endog\n",
    "yhat = glm_mod.mu"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.graphics.api import abline_plot\n",
    "\n",
    "fig = plt.figure(figsize=(12, 8))\n",
    "ax = fig.add_subplot(111, ylabel=\"Observed Values\", xlabel=\"Fitted Values\")\n",
    "ax.scatter(yhat, y)\n",
    "y_vs_yhat = sm.OLS(y, sm.add_constant(yhat, prepend=True)).fit()\n",
    "fig = abline_plot(model_results=y_vs_yhat, ax=ax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Plot fitted values vs Pearson residuals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pearson residuals are defined to be\n",
    "\n",
    "$$\\frac{(y - \\mu)}{\\sqrt{(var(\\mu))}}$$\n",
    "\n",
    "where var is typically determined by the family. E.g., binomial variance is $np(1 - p)$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(12, 8))\n",
    "ax = fig.add_subplot(\n",
    "    111,\n",
    "    title=\"Residual Dependence Plot\",\n",
    "    xlabel=\"Fitted Values\",\n",
    "    ylabel=\"Pearson Residuals\",\n",
    ")\n",
    "ax.scatter(yhat, stats.zscore(glm_mod.resid_pearson))\n",
    "ax.axis(\"tight\")\n",
    "ax.plot([0.0, 1.0], [0.0, 0.0], \"k-\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Histogram of standardized deviance residuals with Kernel Density Estimate overlaid"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The definition of the deviance residuals depends on the family. For the Binomial distribution this is\n",
    "\n",
    "$$r_{dev} = sign\\left(Y-\\mu\\right)*\\sqrt{2n(Y\\log\\frac{Y}{\\mu}+(1-Y)\\log\\frac{(1-Y)}{(1-\\mu)}}$$\n",
    "\n",
    "They can be used to detect ill-fitting covariates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "resid = glm_mod.resid_deviance\n",
    "resid_std = stats.zscore(resid)\n",
    "kde_resid = sm.nonparametric.KDEUnivariate(resid_std)\n",
    "kde_resid.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(12, 8))\n",
    "ax = fig.add_subplot(111, title=\"Standardized Deviance Residuals\")\n",
    "ax.hist(resid_std, bins=25, density=True)\n",
    "ax.plot(kde_resid.support, kde_resid.density, \"r\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### QQ-plot of deviance residuals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(12, 8))\n",
    "ax = fig.add_subplot(111)\n",
    "fig = sm.graphics.qqplot(resid, line=\"r\", ax=ax)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
